Needed packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(visdat)

Introduction- Step 1: Formulate your research question

Are certain industries more likely to lack race and/or gender representation than others?

My primary question is: are certain industries more likely to lack the representation relating to race and/or gender? Answering this question could be accompanied by more questions relating to the “why” of it. Results from the data could guide additional investigation of pathways/resources to certain industries when considering race and/or gender backgrounds. Further concerns of representation could be questioned and reviewed in different settings due to these results. Overall these findings could contribute to any possible needed awareness. Discrimination and systematic oppression are continued social discussions and fights for both women and ethnic minority communities. Discussing occupations and industries contribute to a conversation about existing passive discrimination and needed representation in the workforce. Relaying the question and data to the general public could bring awareness to the importance and benefits of genuine representation. It is essential to have a comprehensive idea of this to increase diversity and equal opportunity. A lack of representation could also call for better cultural sensitivity training and accommodations in the workplace. They may also lead to advocacy for implementations/training needed to diversify occupations/industries.

Methods- Step 2: Get the Data/read in your data

Instruction from tidytuesday are to: Read in with tidytuesdayR package, Install from CRAN via:install.packages(“tidytuesdayR”), and loading the readme and all the datasets for the week of interest. This code is commented out do to issues with formulating HTML, please un-comment if needed.

#tuesdata <- tidytuesdayR::tt_load('2021-02-23')
#tuesdata <- tidytuesdayR::tt_load(2021, week = 9)

#employed <- tuesdata$employed

The data can also be read in the data manually.

employed <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-23/employed.csv')
## Rows: 8184 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): industry, major_occupation, minor_occupation, race_gender
## dbl (3): industry_total, employ_n, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
earn <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-23/earn.csv') 
## Rows: 4224 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): sex, race, ethnic_origin, age
## dbl (4): year, quarter, n_persons, median_weekly_earn
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data set is pulled from Tidy Tuesday. The data was generated by the Bureau of Census for the Bureau of Labor Statistics, where data can be found. Specifically data was collected from “table cpsaat17” over the the time period (2015-2020) provided bty the Bureau of Labor Statistics. This was collected by using The Current Population Survey (CPS) and Current Employment Statistics (CES). These surveys reflect details of employment, unemployment, hours of work, earnings, people who are not in the labor force, and demographics. It should also be noted that they are monthly surveys. The Current Population Survey (CPS) is not limited to gathering information regarding employment, but will be used for employment, employment related status, and demographics in this project. The Bureau of Labor Statistics claim the data is processed and stored on secure servers, and microdata on public-use files. The Bureau of Census collects data, does field editing/ coding, checks consistency, quality control, and transmits data to the Bureau of Labor Statistics. Following this data is again edited and made available to the public. The Bureau of Labor Statistics provides data for “users of BLS data” for a variety of purposes. One of the earliest uses of the data was investigating effects of tariff legislations and has continued to be a source for employment related research, laws, understanding, etc. Considering the multiple uses of the data set, it would seem that the mutability or immutability of the data is important. The data is collected using a survey in households across the country, which is a large data set. Change or stagnation in the data could provide perception of changes or needed changes in this population. I would believe that the idea of mutability and immutability of the data have been created and maintained through different events and simply time within the population. Using functions in R Studio, I plan on selecting specific variables such as industries, races, year, and gender. The data will be downloaded from Tidy Tuesday which incorporates employed persons by industry, sex, race, year and occupation and weekly median earnings and number of persons employed by race/gender/age group over time. The variables available using this data set could be consistent with the idea of power relations, by only including employment and employed related data. This could potentially set the data up to reflect under representation in certain industries, occupations, employment, and/or weekly median earnings. Or this data could be set up to reflect over representation.

Results-Step 3: checking package

skim(employed)
Data summary
Name employed
Number of rows 8184
Number of columns 7
_______________________
Column type frequency:
character 4
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
industry 330 0.96 3 46 0 25 0
major_occupation 0 1.00 19 60 0 5 0
minor_occupation 0 1.00 22 59 0 12 0
race_gender 0 1.00 3 25 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
industry_total 660 0.92 5077105.3 6056215.81 18000 767250 2484000.0 7643000 35894000 ▇▂▁▁▁
employ_n 660 0.92 461551.6 1267564.10 0 9000 65000.0 373000 20263000 ▇▁▁▁▁
year 0 1.00 2017.5 1.71 2015 2016 2017.5 2019 2020 ▇▃▃▃▃

The data includes major and minor occupations in each industry. Industry has a .96 completion rate and a 1.0 completion rate for, major_occupation, minor_occupation, race_gender, and year. Industry_total and employ_n both have a completion rate of .919. For the purpose of the research question we only looked at industry and race_gender.

count(employed, industry) 
## # A tibble: 26 × 2
##    industry                          n
##    <chr>                         <int>
##  1 Agriculture and related         396
##  2 Asian                            66
##  3 Black or African American        66
##  4 Construction                    396
##  5 Durable goods                   396
##  6 Education and health services   396
##  7 Financial activities            396
##  8 Information                     396
##  9 Leisure and hospitality         396
## 10 Manufacturing                   396
## # ℹ 16 more rows
count(employed, race_gender)
## # A tibble: 6 × 2
##   race_gender                   n
##   <chr>                     <int>
## 1 Asian                      1254
## 2 Black or African American  1386
## 3 Men                        1386
## 4 TOTAL                      1386
## 5 White                      1386
## 6 Women                      1386

Industries in this data include: Agriculture and related, Construction, Durable goods, Education and health services, Financial activities, Information, Leisure and hospitality, Manufacturing, Mining, quarrying, andand gas extraction, Mining, quarrying, and oil and gas extraction, Nondurable goods, Other services, Other services, except private households, Private households, Professional and business services, Public administration, Retail trade, Transportation and utilities, Wholesale and retail trade, Wholesale trade. The data also includes “Asian”, “Black or African Amerian”, “Men”, “White”, and “NA.” Race_gender consist of: Asian, Black or African American, White, Men, TOTAL, and Women.

summary(employed$industry)
##    Length     Class      Mode 
##      8184 character character
summary(employed$race_gender)
##    Length     Class      Mode 
##      8184 character character
summary(employed$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2015    2016    2018    2018    2019    2020

#Missing values

vis_miss(employed)

97.1% of the values in the “employed” data is present. From here we will continue to work with “employed” data. The ’cluster= TRUE uses hierachical cluster to order the rows.

vis_miss(employed, cluster = TRUE) +
  coord_flip()

Results-Step 4: Look at the top and the bottom of your data

#employed |> head() |> View()

#employed |> tail() |> View()

The head of the data includes information of Agriculture and related, from 2020, and race_gender being “TOTAL.” The tail of the data is Public administration, from 2015, and race_gender being “Asian.” Head and tail view are commented out due to HTML format, please uncomment.

Results-Step 5: check your Ns

nrow(employed)
## [1] 8184
min(employed$year, na.rm = TRUE)
## [1] 2015
max(employed$year, na.rm = TRUE)
## [1] 2020

The number of rows are 8184 and the data range is from 2015 to 2020.

Results-Step 6: Make a plot

employed |>
  mutate(year) |>
  ggplot(aes(year))+
  geom_bar()

table(employed$year)
## 
## 2015 2016 2017 2018 2019 2020 
## 1364 1364 1364 1364 1364 1364
table(employed$year, employed$race_gender)
##       
##        Asian Black or African American Men TOTAL White Women
##   2015   209                       231 231   231   231   231
##   2016   209                       231 231   231   231   231
##   2017   209                       231 231   231   231   231
##   2018   209                       231 231   231   231   231
##   2019   209                       231 231   231   231   231
##   2020   209                       231 231   231   231   231

The first code of this section (6) provdies a visual and the second code provides numeric count from every year. It seems that the count is the same every year, 1364. The third code provides numeric values of ‘race_gender’ variables each year, appears to have the same count every year. From here we will filter out ‘Women’, ‘NA’, ‘White’, ‘Asian’, and ‘Black or African American’ from the industry column in all codes, but not from all race_gender. Depending on if visuals are focusing on gender or race, there will be some filtering.

#Gender

employed |>
    filter(race_gender %in% c("Women", "Men")) |>
    filter(!(industry %in% c("Women", "NA", "White"))) |>
    ggplot(aes(x = year, y = employ_n, fill = race_gender)) +
    geom_bar(stat = "identity") +
    facet_wrap(vars(industry), scales = 'free_y')
## Warning: Removed 132 rows containing missing values (`position_stack()`).

This graph displays women in blue and men in pink in different bar graphs for each industry. These graphs suggest that “Private household” and “Education and health services” to be obviously more dominated by female than male. In industries such as “Agriculture and related”, “Construction”, “Durable goods”, “Manufacturing”, “Mining, quarrying, and oil and gas extraction”, “Mining, quarrying, and oil and gas extraction”, “Non durable goods”, “Transportation and utilities” and “Wholesale trade” seem to be more dominated by males than females. The X-axis is year and Y-axis is employment count. In this code “Women”, “NA”, and “White” are excluded from industry.

#Race

employed |>
  filter(race_gender %in% c("Asian", "Black or African American", 'White')) |>
  filter(!(industry %in% c("Women", "NA", "White", 'Asian', "Black or African American"))) |>
  ggplot(aes(x = year, y = employ_n, fill = race_gender)) +
  geom_bar(stat = "identity") +
  facet_wrap(vars(industry), scales = 'free_y')
## Warning: Removed 132 rows containing missing values (`position_stack()`).

This graph provides visual of the three different races in each industries, and the display implies that white individuals dominate all industries. Each bar graph is an inudustry with the X-axis being year and Y-axis being employment count. In this code “Women”, “NA”, “White”, “Asian”, and “Black or African American” are exluded from industry column.

Results-Step 7: switching to plots and lines

#Gender

employed %>%
  filter(race_gender %in% c('Women', 'Men')) %>%
  filter(!(industry %in% c("Women", "NA", "White"))) |>
  ggplot(aes(x = year, y = employ_n, color = race_gender, group = interaction(industry, race_gender))) +
  geom_line() +
  geom_point(position = position_dodge(width = 0.2), alpha = 0.5) +
  facet_wrap(~industry, scales = 'free_y') +
  labs(title = "Female and Male Employment in Different Industries Over Time",
       x = "Year",
       y = "Employment",
       color = "Gender") +
  theme_minimal()
## Warning: Removed 132 rows containing missing values (`geom_line()`).
## Warning: Removed 132 rows containing missing values (`geom_point()`).

#Race

employed %>%
  filter(race_gender %in% c('Asian', 'Black or African American', 'White')) %>%
  filter(!(industry %in% c('Women', 'NA', 'White', 'Asian', 'Black or African American'))) |>
  ggplot(aes(x = year, y = employ_n, color = race_gender, group = interaction(industry, race_gender))) +
  geom_line() +
  geom_point(position = position_dodge(width = 0.2), alpha = 0.5) +
  facet_wrap(~industry, scales = 'free_y') +
  labs(title = "Female and Male Employment in Different Industries Over Time",
       x = "Year",
       y = "Employment",
       color = "Gender") +
  theme_minimal()
## Warning: Removed 132 rows containing missing values (`geom_line()`).
## Warning: Removed 132 rows containing missing values (`geom_point()`).

Both plot graphs show similar patterns to the previous bar graphs. In the first code, which looks at gender, excludes “Women”, “NA”, “White” from industry. In the second code, which presents race, excludes ‘Women’, ‘NA’, ‘White’, ‘Asian’, and ‘Black or African American’ from industry.

Resutls-Step 8: Fitted Regression Curve

#Gender excluding “Education and health services”

ggplot(employed %>% 
         filter(race_gender %in% c('Men', 'Women') & !(industry %in% c('White', 'Women','NA', 'Education and health services'))), 
       aes(x = year, y = employ_n, color = race_gender)) +
  geom_point() +
  geom_smooth(aes(group = industry), method = "lm", se = FALSE, linetype = "dashed") +
  labs(title = "Fitted Regression Curve by Race Gender and Industry (Excluding 'White' and 'Women')", x = "Year", y = "Employment") +
  theme_minimal() +
  facet_wrap(~industry)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 132 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 132 rows containing missing values (`geom_point()`).

#Gender only in “Education and health services”

ggplot(employed %>% 
         filter(race_gender %in% c('Men', 'Women') & industry == 'Education and health services'), 
       aes(x = year, y = employ_n, color = race_gender)) +
  geom_point() +
  geom_smooth(aes(group = industry), method = "lm", se = FALSE, linetype = "dashed") +
  labs(title = "Fitted Regression Curve by Race Gender and Industry (Including Only 'Education and health services')", x = "Year", y = "Employment") +
  theme_minimal() +
  facet_wrap(~industry)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

#Race

ggplot(employed %>% 
           filter(race_gender %in% c('Asian', 'Black or African American','White') &
                    !(industry %in% c('White', 'Women', 'NA', 'Asian', 'Black or African American'))), 
         aes(x = year, y = employ_n, color = race_gender)) +
  geom_point() +
  geom_smooth(aes(group = industry), method = "lm", se = FALSE, linetype = "dashed") +
  labs(title = "Fitted Regression Curve by Race Gender and Industry (Excluding 'White' and 'Women')", x = "Year", y = "Employment") +
  theme_minimal() +
  facet_wrap(~industry)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 132 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 132 rows containing missing values (`geom_point()`).

The above codes provide visuals for fitted regression curve or men and women in each industry. “Education and health services” are excluded, as it has been mentioned that women seem to have higher numbers in that industry. The second code is just looking at gender in “Education and health services.” In the first code ‘White’, ‘Women’,‘NA’, ‘Education and health services’ are excluded from industry. In the third code, displaying race, excludes ‘White’, ‘Women’, ‘NA’, ‘Asian’, and ‘Black or African American’ from industry column. All graphs have an X-axis of year and Y-axis industry.

Results-Step 9: Scatterplot

Below is the code for a scatter plot of ‘Men’ in pink and ‘Women’ in blue. The X-axis is year, Y-axis displays industry, and plots are gender(men and women). Unlike the previous graphs, there appears to be more women plotted than men. ‘White’, ‘Women’, ‘NA’, ‘Asian’, and ‘Black or African American’ are excluded from industry.
#Gender

ggplot(employed, aes(x = year, y = as.factor(industry), color = race_gender)) +
  geom_point(data = subset(employed, race_gender %in% c('Men', 'Women') & !industry %in% c('Asian', 'White', 'Women')),
             position = position_jitter(width = 0.3), size = 3, alpha = 0.7) +
  labs(title = "Scatter Plot of Count by Year, Industry, and Race/Gender",
       x = "Year",
       y = "Industry Count",
       color = "Race/Gender") +
  theme_minimal() 

Below here is a scatter plot of race: Asian (pink), Black or African American (green), and White (blue). Similar to previous graph, the X-axis is year and Y-axis are the different industries. It appears to be evenly scattered every year between the three races. ‘Asian’, ‘White’, and ’Women’are exluded from industry.

#Race

ggplot(employed, aes(x = year, y = as.factor(industry), color = race_gender)) +
  geom_point(data = subset(employed, race_gender %in% c('White', 'Asian', 'Black or African American')
                           & !industry %in% c('Asian', 'White', 'Women')),
             position = position_jitter(width = 0.3), size = 3, alpha = 0.7) +
  labs(title = "Scatter Plot of Count by Year, Industry, and Race/Gender",
       x = "Year",
       y = "Industry Count",
       color = "Race/Gender") +
  theme_minimal() 

Results-Step 10: Try the easy solution first

The code below reviews the Pr(working in education and health services/women) vs Pr(education and health services/men) and Pr(working in education and health services/ Asian) vs Pr(education and health services/Black or African American). Oddly, it suggests that every race and gender all have similar percentage in this industry. This does not reflect the graphs above.

employed %>%
  filter(industry == 'Education and health services') |> 
  count(race_gender) |>  
  mutate(share = n / sum(n)) |>  
  arrange(desc(share)) |> 
  mutate(share = scales::percent(share, accuracy = 1)) 
## # A tibble: 6 × 3
##   race_gender                   n share
##   <chr>                     <int> <chr>
## 1 Asian                        66 17%  
## 2 Black or African American    66 17%  
## 3 Men                          66 17%  
## 4 TOTAL                        66 17%  
## 5 White                        66 17%  
## 6 Women                        66 17%

Conslusion

This exploratory data analysis suggests that White and Men are more represented than other races and genders. This data is not fit for purpose due to it not including all genders and all races. Future research needs to include all genders and races.

References

tidytuesday.(2021, February 23) retrieved from https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-02-23/readme.md

Labor Force Statistics from the Current Population Survey. retrieved from: https://www.bls.gov/cps/tables.htm#charemp_m